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Abstract 

We study the dynamics of supervised learning in layered neural net- 
works, in the regime where the size p of the training set is proportional 
to the number N of inputs. Here the local fields are no longer described 
by Gaussian probability distributions. We show how dynamical replica 
theory can be used to predict the evolution of macroscopic observables, 
including the relevant performance measures, incorporating the old for- 
malism in the limit a = p/N — * oo as a special case. For simplicity we 
restrict ourselves to single-layer networks and realizable tasks. 
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1 Introduction 

In the last few years much progress has been made in the analysis of the dy- 
namics of supervised learning in layered neural networks, using the strategy of 
statistical mechanics: by deriving from the microscopic dynamical equations 
a set of closed laws describing the evolution of suitably chosen macroscopic 
observables (dynamic order parameters) in the limit of an infinite system size 
[e.g. Kinzel and Rujan (1990), Kinouchi and Caticha (1992), Biehl and Schwarze 
(1992,1995), Saad and Sofia (1995)]. A recent review and more extensive guide 
to the relevant references can be found in Mace and Coolen (1998a). The main 
successful procedure developed so far is built on the following cornerstones: 

• The task to be learned is defined by a (possibly noisy) 'teacher', which is 
itself a layered neural network. This induces a canonical set of dynamical 
order parameters, typically the (rescaled) overlaps between the various 
student weight vectors and the corresponding teacher weight vectors. 

• The number of network inputs is (eventually) taken to be infinitely large. 
This ensures that fluctuations in mean-field observables will vanish and 
creates the possibility of using the central limit theorem. 

• The number of 'hidden' neurons is finite. This prevents the number of 
order parameters from being infinite, and ensures that the cumulative 
impact of their fluctuations is insignificant. 

• The size of the training set is much larger than the number of updates 
made. Each example presented is now different from those that have al- 
ready been seen, such that the local fields will have Gaussian probability 
distributions, which leads to closure of the dynamic equations. 

These are not ingredients to simplify the calculations, but vital conditions, 
without which the standard method fails. Although the assumption of an 
infinite system size has been shown not to be too critical (Barber et al, 1996), 
the other assumptions do place serious restrictions on the degree of realism 
of the scenarios that can be analyzed, and have thereby, to some extent, 
prevented the theoretical results from being used by practitioners. 

In this paper we study the dynamics of learning in layered neural networks 
with restricted training sets, where the number p of examples ('questions' 
with corresponding 'answers') scales linearly with the number N of inputs, i.e. 
p = aN. Here individual questions will re-appear during the learning process 
as soon as the number of weight updates made is of the order of the size of 
the training set. In the traditional models, where the duration of an update is 
defined as A^ -1 , this happens as soon as t = 0(a). At that point correlations 
develop between the weights and the questions in the training set, and the 
dynamics is of a spin-glass type, with the composition of the training set 
playing the role of 'quenched disorder'. The main consequence of this is that 
the central limit theorem no longer applies to the student's local fields, which 
are now described by non-Gaussian distributions. To demonstrate this we 
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trained (on-line) a perceptron with weights Jj on noiseless examples generated 
by a teacher perceptron with weights E>i, using the Hebb and AdaTron rules. 
We plotted in Fig. 1 the student and teacher fields, x = J £ and y = B £ 
respectively, where £ is the input vector, for p = N/2 examples and at time 
t = 50. The marginal distribution P(x) for p = iV/4, at times t — 10 for 
the Hebb rule and t = 20 for the Adatron rule, is shown in Fig. 2. The 
non-Gaussian student field distributions observed in Figs. 1 and 2 induce a 
deviation between the training- and generalization errors, which measure the 
network performance on training and test examples, respectively. The former 
involves averages over the non-Gaussian field distribution, whereas the latter 
(which is calculated over all possible examples) still involves Gaussian fields. 

The appearance of non-Gaussian fields leads to a breakdown of the stan- 
dard formalism, based on deriving closed equations for a finite number of 
observables: the field distributions can no longer be characterized by a few 
moments, and the macroscopic laws must now be averaged over realizations 
of the training set. One could still try to use Gaussian distributions as large 
a approximations, see e.g. Sollich and Barber (1998), but it will be clear from 
Figs. 1 and 2 that a systematic theory will have to give up Gaussian distri- 
butions entirely. The first rigorous study of the dynamics of learning with 
restricted training sets in non-linear networks, via the calculation of generat- 
ing functionals, was carried out by Horner (1992) for perceptrons with binary 
weights. In this paper we show how the formalism of dynamical replica theory 
(see e.g. Coolen et al, 1996) can be used successfully to predict the evolution 
of macroscopic observables for finite a, incorporating the infinite training 
set formalism as a special case, for a — > oo. Central to our approach is the 
derivation of a diffusion equation for the joint distribution of the student 
and teacher fields, which will be found to have Gaussian solutions only for 
a — > oo. For simplicity and transparency we restrict ourselves to single-layer 
systems and noise-free teachers. Application and generalization of our meth- 
ods to multi-layer systems (Saad and Coolen, 1998) and learning scenarios 
involving 'noisy' teachers (Mace and Coolen, 1998b) are presently under way. 

This paper is organized as follows. In section 2 we first derive a Fokker- 
Planck equation describing the evolution of arbitrary mean-field observables 
for N — > oo. This allows us to identify the conditions for the latter to be 
described by closed deterministic laws. In section 3 we choose as our observ- 
ables the joint field distribution P[x, y], in addition to (the traditional ones) 
Q and R, and show that this set obeys deterministic laws. In order to close 
these laws we use the tools of dynamical replica theory. Details of the replica 
calculation are given in section 4, to be skipped by those primarily interested 
in results. In section 5 we show how in the limit a — > oo (infinite training sets) 
the equations of the conventional theory are recovered. Finally we work out 
our equations explicitly for the example of on-line Hebbian learning with re- 
stricted training sets, and compare our predictions with exact results (derived 
directly from the microscopic equations) and with numerical simulations. 
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Figure 1: Fig. 1: Student and teacher fields (x, y) as observed during numerical 
simulations of on-line learning (learning rate rj = 1) in a perceptron of size 
N = 10, 000 at t = 50, using 'questions' from a restricted training set of size 
p = |7V. Left: Hebbian learning. Right: AdaTron learning. Note: in the case 
of Gaussian field distributions one would have found spherically shaped plots. 
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Figure 2: Fig. 2: Distribution P(x) of student fields as observed during nu- 
merical simulations of on-line learning (learning rate 77 = 1) in a perceptron 
of size N = 10, 000, using 'questions' from a restricted training set of size 
p = ^N. Left: Hebbian learning, measured at t = 10. Right: AdaTron learn- 
ing, measured at t = 20. Note: not only are these distributions distinctively 
non-Gaussian, they also appear to vary widely in their basic characteristics, 
depending on the learning rule used. 
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2 From Microscopic to Macroscopic Laws 
2.1 Definitions 

A student perceptron operates the following rule, which is parametrised by 
the weight vector J G $t N : 

S : {-1, 1}" -> {-1, 1} S(£) = sgn [J ■ £] (2.1) 

It tries to emulate the operation of a teacher perceptron, via an iterative 
procedure for updating its parameters J. The teacher perceptron operates a 
similar rule, characterized by a given (fixed) weight vector B e $l N : 

T : {-1, 1}" -> {-1, 1} T{£) = sgn [B • £] (2.2) 

In order to do so, the student perceptron modifies its weight vector J accord- 



\ ^^^^^ 
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Student 






S{i) = sgn[J-£] 


* 



-s(0 



Teacher 
T(£) = sgn[B-C] 



Figure 3: Fig. 3: Supervised learning in perceptrons. 

ing to an iterative procedure, using examples of input vectors (or 'questions') 
£, drawn at random from a fixed training set D C D = { — 1, 1}^, and the 
corresponding values of the teacher outputs T(£), see Fig. 3. 

We consider the case where the training set is a randomly composed subset 
D C D, of size \D\ = p = aN with a > 0: 



£> = {?,.. 



p = aN 



{"efl for all /i 



(2.3) 



We will denote averages over the training set D and averages over the full 
question set D in the following way: 



i 

\D\ 



E and ($(0), 



1 

LDI 



E m ■ 



We will analyze the following two classes of learning rules: 

on — line : J(m+1) = J(m) + jL £(m) Q [J(m)-£(m), B-£(m) 
batch: J(m+l) = J(m) + § (£ Q [J(m) B-£]> 5 



(2.4) 
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In on-line learning one draws at each iteration step m a question £(m) £ D 
at random, the dynamics is thus a stochastic process; in batch learning one 
iterates a deterministic map. The function is assumed to be bounded 

and not to depend on N, other than via its two arguments. 

Our most important observables during learning are the training error 
E t (J) and the generalization error E g (J), defined as follows: 

E t (J) = (6[-(J-Z)(B-Z)])i> , (2-5) 

E g (J) = {9[-(J.t)(B-t)]) D . (2.6) 

Only if the training set D is sufficiently large, and if there are no correlations 
between J and the questions £ £ D, will these two errors will be identical. 



2.2 From Discrete to Continuous Time 



We next convert the dynamical laws (|2.4|) into the language of stochastic 
processes. We introduce the probability p m (J) to find weight vector J at 
discrete iteration step m. In terms of this microscopic probability distribution 



the processes (|2.4j ) can be written in the general Markovian form 
Pm+i(J) = J dJ' W[J; J'] p m {J') , 



with the transition probabilities 

on-line: W[J\ J'} = (5 
batch : W[J; J'] = 5 



j-j'-%tg[j'-z,B-t] ) 3 



(2.7) 



(2- 



We now make the transition to a description involving real- valued time labels 
by choosing the duration of each iteration step to be a real-valued random 
number, such that the probability that at time t precisely m steps have been 
made is given by the Poisson expression 

n m (t) = —(Nt) m e- Nt . (2.9) 



ml 



For times t 3> iV -1 we find t = m/N + 0(N~%), the usual time unit. Due to 
the random durations of the iteration steps we have to switch to the following 
microscopic probability distribution: 

Pt (J) = Km(t) Pm{J) ■ (2-10) 

m>0 

This distribution obeys a simple differential equation, which immediately fol- 



[2.11] 



lows from the pleasant properties of (|2.9| ) under temporal differentiation: 
j t p t (J) = N JdJ' {W[J; J'] - 5[J-J']} p t (J') . 

So far no approximations have been made, equation Q2.ll ) is exact for any 
N. It is the equivalent of the master equation often introduced to define the 
dynamics of spin systems. 
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2.3 Derivation of Macroscopic Fokker-Planck Equation 



We now wish to investigate the dynamics of a number of as yet arbitrary 
macroscopic observables Q[J] = (fli[J], . . . , f2 fe [J]). To do so we introduce a 
macroscopic probability distribution 



P t (Q) = JdJ Pt (J) 8[fl-n[J]] . 
Its time derivative immediately follows from that in ([2.11|) : 

;P t (ft) = JdQ'W t [a,Q']P t (Q'), 



d 
d~t 



(2.12) 



(2.13) 



where 



w t [n;n'] 



JdJ' pt(J') S[Q'-n[J']] JdJ 5[ft-n[J]]N{W[J; J'}-5[J-J'}} 
JdJ' Pt(J') 6[Q'-n[J']] 



If we insert the relevant expressions fl2.8|) for W[J; J'} we can perform the J- 
integrations, and obtain expressions in terms of so-called sub-shell averages, 
defined as 

(f(T]) jdJ Pt (j)5[n-n[j]] /(j) 
w» n * jdj p t (j) 5 [n-n[j]] ' 

For the two types of learning rules at hand we obtain: 

w t onl [ft; n'} = n(s [n-n[j+-£ £g\j £,b \ -6 [n-n[Ji]\ 

\ L "TV J / d to? \t 



W t bat [^;^1 =N(6 



n-fi[j+l(tg[j't,B't]) a ] 



-s[n-n[j] 



Wit 



We now insert integral representations for the 5-distributions. This gives for 
our two learning scenario's: 



dn M-n N // e - i an[j + %zg[j.&B.$}} 



(2vr) fc 



-iCi.tl[J] 



(2.14) 



N ( e 



J+%UG[J-&Bi) 



i(l-Q[J] 



\ 

(2-15) 

Still no approximations have been made. The above two expressions differ 
only in the stage where the averaging over the training set is carried out. 

In expanding equations (|2.14| , |2~7L5|) for large N and finite t we have to be 
careful, since the system size N enters both as a small parameter to control 
the magnitude of the modification of individual components of the weight 
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vector, but also determines the dimensions and lengths of various vectors 
that occur. We therefore inspect more closely the usual Taylor expansions: 



TV 



A' 



l>0 c - j 1= l i e =l 



• hi 



d e F[J] 



dJ h ■ ■ ■ dJ i( 



If all derivatives were to be treated as 0(1) (i.e. if we only take into account 
the scaling of the shift k with N), problems could arise, since in the cases 
of interest (where k 2 = 0(N~ 1 )) this series could diverge as F{J + k] = 
J2e>o(J2ikiY — X^>o0(l)- W we assess how derivatives with respect to indi- 
vidual components Jj scale for the standard types of mean-field observables, 
we find the following scaling property which we will choose as our definition 
of mean-field observables: 



&F[J\ 



dJ h ■ ■ ■ dJ it 



O 



f F[J] 



N¥- d \ 



(N 



oo 



(2.16) 



in which d is the number of different elements in the set {zi, . . . , If F[J] 



is a mean-field observable in the sense of Q2.16|) , we can estimate the scaling 
of the various terms in the Taylor expansion: 



F [j+k] = f[j] + e kl ^l + \ £ k ikj ^- + (f[j] 



dJ t 2 ^ 1 J OJ.dJ, 



e>3 





~\k\~ 








\F[J] 


_\J\_ 





_ (2.17) 
(in the last step we have used J2i hi = 0(VN\k\)). 

We apply (|2~T7l) to our macroscopic equations (p.l4j,^T5|) , restricting our- 
selves from now on to mean-field observables fl[J] = O(N ) in the sense 
of (|2.16 ), one of which we choose to be J 2 . Here the shifts k, being either 



£ Q[J ■ i,B i] or Q[J £;B scale as \k\ = 0(Ns). Conse- 



quently the £-th order term in the expansions of both (|2.14j) and (|2.15| ) will 

£ 

be of order iV~2: 



OJi 



d 2 - 1 

'n-n[j])-- 



3 dJidJj 



+ 0(N~ 



This, in turn, gives 



dn e^'°iV 



(2*) 



iCi.n[j^k]_ -ift.n[j] 



d 



dn,[j) i g^yj] 
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(J 2 



£ ^ : 



dn^[j}dn u [j] 



5[n-fl[J}] + 0(N- 



It is now evident, in view of ( p. 14 , 2. 15 ), that both types of dynamics are 
described by macroscopic laws with transition probability densities of the 
general form 



+ 0{N~ 



which, due to Q2.13Q and for N — > oo and finite times, leads to a Fokker-Planck 
equation: 



d 
dt 



P *(°) = -E^-WW)} 



5 2 



-E 



{G™[n ; t]p t (n)} 



(2.18) 

The differences between the two types of dynamics are in the explicit expres- 
sions for the flow- and diffusion terms: 



dry J] 



5/0:1 



5 / O;* 



dO*[J] 



dJi 
9Jj / nr. 



04 



5/ 0:t 



+ ^ ( £& £[ J ■ & B ■ £]>5<£i 0[J ^ B $)} 6 



'J 



dJidJj /o ; t 



<9J,: 



O;* 



Equation (|2.18 ) allows us to define the goal of our exercise in more explicit 
form. If we wish to arrive at closed deterministic macroscopic equations, we 
have to choose our observables such that 

1. liHiN^oo G>„[f2; t] = (this ensures determinism) 

2. liniAT^oo J^-i^ff^t] = (this ensures closure) 

In the case of time-dependent global parameters, such as learning rates or 
decay rates, the latter condition relaxes to the requirement that any explicit 
time-dependence of F M [fi;t] is restricted to these global parameters. 
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3 Application to Canonical Observables 
3.1 Choice of Canonical Observables 

We now apply the general results obtained so far to a specific set of observables 
Q, [J], taylored to the problem at hand: 

Q[J] = J 2 , R[J] = J-B, P[x,y;J} = (8[x-J-€\8[x-B-£]) 3 (3.1) 

This choice is motivated by the following considerations: (i) in order to in- 
corporate the standard theory in the limit a — > oo we need at least Q[J] and 
R[J], (ii) we need to be able to calculate the training error, which involves 
field statistics calculated over the training set D, as described by P[x, y; J], 
and (iii) for finite a one cannot expect closed macroscopic equations for just 
a finite number of order parameters, the present choice (involving the order 
parameter function P[x,y;J]) represents effectively an infinite numberQ In 
subsequent calculations we will, however, assume the number of arguments 
(x,y) for which P[x,y; J] is to be evaluated (and thus our number of order 
parameters) to go to infinity only after the limit N — > oo has been taken. 
This will eliminate many technical subtleties and will allow us to use the 
Fokker-Planck equation ( |2.18| ). 

The observables ( |3.1| ) are indeed of the mean-field type in the sense of 
( |2.16| ). Insertion into (|2.16|) immediately shows this to be true for the scalar 



observables Q[J] and R[J\. Checking ( |2.1(j| ) for the function P[x, y; J] is less 
trivial. Here we have to use the property that D has been composed in a 
random manner. We denote with X the set of all different indices in the list 
(ii, . . . ,ii), with nk giving the number of times a number k occurs, and with 
X ± C X defined as the set of all indices k G X for which n& is even (+), or odd 
(— ). Note that with these definitions t = J2k^i+ n k + J2k^i- n k > 2|X + | + |X~|. 
We then obtain the following scaling identity: 

(C, ... C ^".r ./ 6\y-B-t]) 6 



. e I dx dy ^ xi+vij] 



dx dy 



IR 

kei 



n h p ~it,h [xJk+yB k ] 



n« 

k<fl 



,i[xx+yy] 



0(N-^-\) = O (iV-il x "l) (3.2) 



This gives for derivatives of P[x, y; J]: 

P[x,y; J] = (-lYi-ji^. • -i k 5[x-J-i] 5[y-B.£]) b = O (N~^ 



dJ ix ...dJ ilL ' ' dx 1 

Since \l-\T\+\\X~\ = ±[£-\I-\-2\l+\] > 0, the function P[x,y; J] is indeed 
found to be a mean-field observable. 



^ simple rule of thumb is the following: if a process requires replica theory for its 
stationary state analysis, as does learning with restricted training sets, its dynamics is of 
a spin-glass type and cannot be described by a finite set of closed dynamic equations. 
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3.2 Deterministic Dynamical Laws 

We next show that for the observables (13. II) the diffusion matrix elements G 



*** 
I"' 



in the Fokker-Planck equation ( 2.18| ) vanish for iV — > oo. Our observables will 
consequently obey deterministic dynamical laws. The diffusion terms associ- 
ated with Q[J] and R[J] are trivial. For on-line learning we find: 



sionl r 
^QRl 
syonl r 
^RRl 



— J dx dy P[x,y] G 2 [x,y] 



(2Ji){2Jj) 
i 2J i)( B j) 

Ax 2 
2xy 



D / QRP;i 



v 



oil 

N 



where the notation (• • ')cpp ; t refers to sub-shell averages with respect to the 
order parameters Q, R and {-P[#, y]} at time t. For batch learning, similarly: 



^QRl- ■ 

G h R %. 



If 

iV 



{2Ji){2Jj) 
(2J i )(S J ) 
(Bi)(Bj) 



4{Jdx dy P[x,y] x Q[x,y}} 2 
Jdx dy P[x; y] x Q[x, y] \\ Jdx dy P[x; y] y Q[x, y] 



Jdx dy P[x;y] y G[x,y] 



oil- 

N 



In calculating diffusion terms which involve the order parameter function 
P[x;y; J] we will again need the scaling property ( |3.2|) . First we turn to on- 
line learning diffusion terms with just one occurrence of P[x,y; J] 



^Q,P[x,y]i 
/'"fonl f 
^ R,P[x,y\V 



V 







-v 



dx 

dx dy P[x,y] G 2 [x,y] 



2J t (£< 6[x-J-£] 5[y-B.£'}) 6 
Bt 6[x-J-g] 6[y-B-£'}) 3 



b / qpp;t 



2x 

y 



o 



N, 



O 



N , 



For batch learning we find a similar result: 



^R,P[x,y]l 







(2Ji)(C s 5[x-J-g] 5[y-B.£']) 5 
WiqSix-J-g] 6[y-B-g)} 6 



-T] 



dx 

dx dy P[x,y] G[x,y) 



QRP;t 



2x 

y 



o 



N, 



O 



r N. 
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The non-trivial terms are those where two derivatives of the order parameter 
function P[x, y; J] come into play. These are dealt with by separating i = j 
from i ^ j terms, in combination with 



rf I d 2 
G°p[x,y],PW,y'] [■■■] = -ft ( g ^ J '^ B '^ 6 d^d? 

\ ij 

<?,' 6{x-J-€] S[y-B-t'}) 5 Stf-J-g] SW-B-Ms 

Similarly: 



P[x,y],P[x',y'] L---J -Jf \ Z^\« ^L"-S»-"-«J/D\y »L«'-«)-"-SJ/c 

(& 5[x-J-g) 8[y-B.g)) 6 5[a/-J-£'] %'-B-£']> 

All diffusion terms vanish in the limit N — > oo. The Fokker-Planck equation 
( gT|) reduces to the Liouville equation ^P t (0) = - E M ^-[^[^; ^t(^)], 



describing deterministic evolution for our macroscopic observables: J^fi = 
F[Q; t}. These deterministic equations we will now work out explicitly. 

On-Line Learning 

First we deal with the scalar observables Q and R: 

j t Q = Jim^{2r / (((J-0^[J^,B^]) 6 )^ + r / 2 (^ 2 [J-tB^]) i5 ) C3RP ,} 
= 2r] J dxdy P[x, y] x Q[x, y] + rf J dxdy P[x, y] 2 [x, y] (3.3) 

j t R = Tump <((B-£) Q[J.£, B^]) 6 )^. t = rj j dxdy P[x,y] y G[x,y] (3.4) 

The equations (|3.3|J3.4j ) are identical to those found in the a — > oo formalism. 
The difference is in the function to be substituted for P[x,y], which here is 
the solution of 



Of 



P[ Xj y} = Km L^^T^ g[J-t,B.$) 3 (% 5[x-J^']5[y-B^']) 3 ), 

7V-+oo y ox i 



rf d 2 



2Ndx 2 



(Em GV't,B>t]) B {$% 5[x-J-f] 5[y-B 
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According to ( |3.2| ) the off-diagonal terms i 7^ j in the contribution with the 
second derivative together contribute only vanishing orders (iV -1 ), so that 
we need only consider the diagonal i = j ones: 

q£P[x,v] = Mm (((g[J-t,B.$ ($■$') 6[x-J-?\ Sfr-B.?])*)^ 

d f 1 f d 2 

= -Vfa J dx ' d y' G\x ■, y']C[x, y; x', y']+^V 2 J dx'dy' P[x', y']G 2 [x', y'\ q^ p [ x , v] 

X (3-5) 

with the function 

C[x,y;x\y'] = lim ({(5[x-J-£] 5[y-B^\ 5[x'-J^\ Stf-B^)*)** 

(3.6) 

Batch Learning 

Here we can again use the scaling relation ( |3.2|) to eliminate terms. For Q 
and R one finds 

|Q = jim {2^ «(J.£) g[J^B^]) )^ + ^(£(^ GiJ&BtVl)^ 

= 2n J dxdy P[x, y] x Q[x; y] (3.7) 
±R= lim rj «(£•£) 5[JiB^]> fi )« 

= 7] J dxdy P[x, y] y Q[x] y) (3.8) 
Finally we calculate the temporal derivative of the joint field distribution: 

^P[x,y] = ^|-r ? ^(((^[J.eB-|](|-05k-J^l5[y-S.| / ]) 6 ) 6 ) ( ^ ;t 

= -V^ J dx'dy' g[x', y'] C[x, y; x', y'] (3.9) 

The difference between the macroscopic equations for batch and on-line learn- 
ing is merely the presence (on-line) or absence (batch) of those terms which 
are quadratic in the learning rate rj. 
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Resume 

We separate the function C[x, y; x', y'\ (|3.6| ), which plays a role similar to that 
of a Green's function, into two terms (£ = versus £ 7^ £'): 

C[ar,y;x',y'] = a _1 5[x-x'] 5[y-y'\ P[x,y] + A[x,y; x' ,y'] 

A[x,y;x',y'] = lim (3.10) 
((([l-6tf)6[x-J-$\ 5[y-B-$] (£•£') 5[x'-J^\ 8[y' - B ■ $'}) 6 ) s ) ^ 

With this definition our macroscopic laws, which are exact for N —>■ 00 but 
not yet closed due to the appearance of the microscopic probability density 
Pt(J) in the sub-shell average of (|3.10|) , can be summarised as follows 

= 2r] Jdxdy P[x, y] x Q\x\ y] + nrf J dxdy P[x, y] G 2 [x; y] (3.11) 



d f 

—R = r)jdxdyP[x,y]yg[x;y] (3.12) 
ti d d f 

— P[x, y] = --— [Q[ x , y] P[x, y\\ - r]— J dx'dy' Q\x , y ] A[x, y; x, y] 
+ 1 -kt, 2 Jdx'dy* P[x',y'\ G 2 [x',y'\ ^P[x,y] (3.13) 

in which k = 1 for on-line learning and k = for batch learning. The complex- 
ity of the problem is fully concentrated in the Green's function A[x, y; x', y'}. 



3.3 Closure of Macroscopic Dynamical Laws 

We now close our macroscopic laws by making, for iV — >• 00, the two key 
assumptions underlying dynamical replica theories: 

1. Our macroscopic observables {Q, R, P} obey closed dynamic equations. 

2. These macroscopic equations are self-averaging with respect to the dis- 
order, i.e. the microscopic realisation of the training set D. 

Assumption 1 implies that all microscopic probability variations within the 
{Q, R, P} subshells of the J-ensemble are either absent or irrelevant to the 
evolution of {Q, R, P}. We may consequently make the simplest self-consistent 
choice for pt{J) in evaluating the macroscopic laws, i.e. in (|3.10| ): microscopic 
probability equipartitioning in the {Q, R, P}-subshells of the ensemble, or 

Pt(J) -> w(J)^S[Q-Q[J]]6[R-R[J]] l[6[P[x,y}-P[x,y,J\} (3.14) 

xy 



This distribution depends on time via the order parameters {Q, R, P}. Note 
that Q3.14Q leads to exact macroscopic laws if our observables {Q, R, P} for 
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iV — > oo indeed obey closed equations, and is true in equilibrium for de- 
tailed balance models in which the Hamiltonian can be written in terms of 
{Q, R, P}. It is an approximation if our observables do not obey closed equa- 
tions. Assumption 2 allows us to average the macroscopic laws over the disor- 
der; for mean-field models it is usually convincingly supported by numerical 
simulations, and can be proven using the path integral formalism (see e.g. 
Horner, 1992). We write averages over all training sets D C { — 1, 1}^, with 
\D\ = p, as (. . .)s- Our assumptions result in the closure of ( p.ll|J3.l2"l , |3.13|) , 
since now the function A[x, y; x', y f ] is expressed fully in terms of {Q, R, P}: 

A[x, y; x', y') = lim 

iV-+oo 

/JdJw(J) ((5[x-J-£\ 5[y-B-Z](t.£)[l-5^] 5[x>-J-g] 8[y' - B ■ £']) 6 ) 5 \ 
\ fdJ w(J) / 

The final ingredient of dynamical replica theory is the realization that aver- 
ages of fractions can be calculated with the replica identity 



giving 



„ n 

A[x, y\ x', y'] = lirn^ hmJY[ w(J a ) dJ 



((5[x-J l .£\ 5[y-B-d] (£•£') [1-5^] Six'-J 1 -?} 5[y' - B B 

Since each weight component scales as J" = 0(N~^) we transform variables 
in such a way that our calculations will involve 0(1) objects: 



(Vi)(Va) : 



J? 



(Q/N)*of, Bi = N-*Ti 



This ensures af = 0(1), Tj = 0(1), and reduces various constraints to ordi- 



nary spherical ones: (cr 



a\2 



N for all a. Overall pref actors generated 



by these transformations always vanish due to n — > 0. We find a new effective 
measure: YT a =iw(J a ) dJ a -> W a =iw((T a ) da a , with 



w cr 



s 



N-a- z 



NRQ^-r-a]l[S[P[x,y}-P[x,y; (Q/N)h 

xy 



(3.15) 



We thus arrive at 



A[x, y; x', y'\ = lim Jim / fl w(cr a ) d* a (((£'• £)[l-5 

a=l ' " 



X- 



N 



N 



(3.16) 



D D S 
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In the same fashion one can also express P[x, y] in replica form (which will 
prove useful for normalization purposes and for self-consistency tests): 



P t [x, y] = lim lim / f[ w{<r a )d<r a (S 



X- 



N 



5 



(3.17) 



4 Replica Calculation of the Green's Function 



4.1 Disorder Averaging 

In order to perform the disorder average we insert integral representations 
for the ^-functions which define the fields (x, y, x', y') and for the 5-functions 
in the measure ( 3.15|) which involve P[x,y], generating n conjugate order 
parameter functions P a (x,y). Upon also writing averages over the training 
set in terms of the p constituent vectors {^} we obtain for ( |3.16|) and ( ft.17 ): 



A[x,y;x,y] = 

, n ( 

H\d(T a 5 



dx dx'dy dy' 
(2vr) 4 



„ n 

e i [xi+x 'x'+yy+yy'] Um / TT TT dP a (x",y") 



N-(tr 



a\2 



a=l 



NR 



Q 



- T ■ <T 



a=l x"y" 
e iN J dx"dy" P a (x",y") P t (x",y") 



l_ ...... -tEa^^^^.^-^ri*^ 1 ^]-^!*'^ 1 ^ 



(4.1) 



n i d(ra5 



a=l 



N-(tr 



a\2 



NR 



Q 



- T • <T 



D iN J dx"dy" P a (x",y") Pt(x",y") 



'1' -iE^-^^-^W^ 



(4.2) 



In calculating the averages over the training sets (...) s that occur in ( |4.1| ) and 
( |4.2|) one can use permutation symmetries with respect to sites and pattern 
labels, leading to the following compact results: 



oP log© [0,0] I y-' W^jfflj Vj 



N 



£> 2 [0,0] 



+ C(iV~5) 



(4.3) 
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and 



p JU=1 / 3 
= pio g x>[o,o]PF>y] 



[o,o] v ; 



in which 



T>[u,v] 
Sj[u,v] 



(4.4) 



and with the abbreviation (/[£])£ = 2 w E^ e {_ x 1 j.jv /[£]■ These quantities 
(which are both 0(1) for A/" — > oo) are, in turn, evaluated by using the 
central limit theorem, which ensures that for iV — > oo the n rescaled inner 
products cr" ■ £/\/~N and the rescaled inner product r • £/VN will become 
(correlated) zero-average Gaussian variables. After some algebra one finds 

C[u, v; u, v'\ = E S A U > v \ £ A u 'i v '\ = 
j 

a/3 la J la 

-RY] \-J^[u,v] + uS a iD[u,v\] \-F£[u',i/] +v' 5biV[u',v'\ 
a0 la J La 



1 



-RJ2 -FiW, v'} + u 1 S al V[u', v' 



a(3 



.a 



1 



JF^[m, v] + v 5piV[u, v] 



a 



E 

a/3 



a 



-F%[u, v]+v 8 al V[u, v] 



a 



-T2 [vf, v'\ + V 5/3iV[u, v'\ 



0{N~ 



(4.5) 



in which V[u,v] and the J~\\u, v] are given by n+1 dimensional integrals: 



i . _i 



x \ . x 



fdx dy det*A M r l 

V ^=J (27r) (n + l)/ 2 6 W V " 



i ^ a P a (VQx a ,y)-i[u,/Qxi+vy] 



dx dy det 5 A 

(27r)( n+1 )/ 2 



d\P a (jQx a ,y) e 



X \ A X 



y \y 



(4.6) 



^Ea A»(-^X a ,J/)-i[uv^Xl+UJ/] 



(4.7) 
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with A e {1, 2}. The matrix A in ( }4.6| , |4.71) is defined by 

/ gii ■ • • gin R/ VQ\ 

A- x = 



R/VQ 

\R/VQ ■■■ R/VQ l J 



MM) = ^ X>°«f ( 4i 



Note that the quantities ([4.6U4.7| ) depend on the microscopic variables cr c 
only through the spin- glass order parameters q a p({cr}). 



4.2 Derivation of Saddle-Point Equations 



We combine the results (|4.3j . |4.4| . [4~5l) with ( |4.1| , |4.2| ). We use integral represen- 



tations for the remaining ^-functions, and isolate the q a p, by inserting 

dq dq dQ dR iN[J2jQ«+R a R/V<3)+Y, aS «<*0<i<x0} 
{27r/N) n2 + 2n 



We hereby achieve a full factorisation over sites, and both ( |4.1| ) and ( f4.2| ) can 
be written in the form of an integral dominated by saddle-points: 

A[x, y; x', y'] = f - f ^ ^ e ^+^'+«] 



lim lim [dq dq dQ dR J[ dP a (x",y") e ^B40to £ W«1 

P[x,y) = f^3^ +v " ] 

hm lim [dq dq dQ dR TT dP a (x",y") ^[9,9,0,^,^}] ^Ml 

n^ON^ooJ y J-l ° l ' P[ ,0] 



with 



*[...] = i J2(Q a + R a R/ \[q) + «EW?^ + *E / dx d y ^( x > y) p i x i v\ 

+alog£>[0,0l + lim — Vlog fdtr e^EJQ^+^^HE^^ 

The above expressions for A[x, y; x', y'\ and P[x, y] will be given by the inten- 
sive parts of the integrands, evaluated in the dominating saddle-point of 
We can use the equation for P[x, y] to verify that all expressions are properly 
normalized. After a simple transformation of some integration variables, 



Qa/3 — > Qa/3 ~ Ra ~ > V QR a 
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we arrive at the simple result 

'dx dx' dy 



A[x,y;x',y'] 
P[x,y] 



_ e i[xx+x'x'+yy+yy'} jj m 



C[x,y;x',y r 



dx <l<J ; ,,:r., w V[*,y] 

(2tt) 2 «-oD[0,0] 



o X? 2 [0,0] 



(4.9) 
(4.10) 



in which all functions are to be evaluated upon choosing for the order param- 
eters the appropriate saddle-point of ^, which itself takes the form: 

. .] = i^2Q a {l-q aa )+iR^2R a +i^2q a i3 q a p+iYl I dxdy P a (x,y)P[x,y] 

a a a /3 a 

+ alogD[0,0] + Jim^^log J da e ~ iTiy/Q ^ ««*^ ( 4 .n) 

i 

With V[u, v] given by ( |4.6| ) and with the function C[u, v; u', v'] given by ( |4.5| ). 
The auxiliary order parameters q a p have the usual interpretation in terms of 
the average probability density for finding a mutual overlap q of two indepen- 
dently evolving weight vectors with the same realization of the training set 
(see e.g. Mezard et al, 1987): 



{P(q))s 



5 



J a -J b 



q- 



\J a \\J b \ 

We now make the replica symmetric (RS) ansatz in the extremisation 
problem, which according to (|4.12 ) is equivalent to assuming ergodicity. With 
a modest amount of foresight we put 

q a /3 = qo$a(3 + q[l-S aP ] q a(3 = - [r - r 5 a/3 ] 

Ra = ip Qa = i<j> Pa{u, v) = v] 

This allows us to expand the quantity \1/ of ( 4.11[) for small n: 

lim . .] = -<f)(l-q ) - pR+ \qr - \qo{r-r ) - ^logr + -^~{r+p 2 Q) 
n-+o n III lr 

/a 
dxdy x[ x i y] P[ x i y] + hm — logD[0, 0] + constants 

At this stage it is useful to work out those saddle-point equations that follow 
upon variation of {0, r, p, r }: 



qo = 1 



r 



1 



P 



R 

Q(l-q) 



qQ-R 2 
Q(l-qY 
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These allow us to eliminate most variational parameters, leaving a saddle- 
point problem involving only the function y] and the scalar q: 



hm -V[q,{x}]= 2(1- q) + 2 los ^~^~ I dxd V x[x,y] P l x ^] 



a 



+ lim — log T>\0, 0; q, {v}] + constants 

n->0 n 



(4.13) 



Finally we have to work out the RS version of T>[0, 0; q, {x}]> as defined more 
generally in Q4.6|) . The inverse of the matrix in (|4.8|) , in RS ansatz, is found 
to be: 



(C n ■ 




7\ 


C n i ■ 




7 


V 7 " 


.. 7 





Q/j 



T^q 



7 



1-9 



0{n) 



■d b=l + 0(n) 



With this expression we obtain 

Jdx dy e -ia!'Ci-i'» s -7»E.*«+;E.x(v^ 1 ») 



(4.14) 



^[0,0;g,{ X }] 



/da; dy e~ s^C*- 1^ 2 -72/ 



JDzDy 



Jdx e 



JDzDy 



a 



lim -log£>[0,0;g, {x}] = a \ DzDylog 



Jdx e 2 ^-«) 1 ^ J 



Jdx e 



2Q(l-9) 



-a;[z\/(i-7j/]/ v / Q 



We can simplify this result by defining 

A = R/Q(l-q) B = ^qQ-R 2 /Q(l-q) 

which gives 



(4.15) 



r , Jdx e -^) +X[Ay+Bz]+ « XiX ' y) 

hm - logX> 0, 0; q, {x}\ = a / DzDy log < - 2 — — 



Upon carrying out the ^-integration in the denominator of this expression 
we can now write ( ETT3D in a surprisingly simple form (with the short-hand 

(EH): 



{%}]= 1 2 ^_^^ + ^(l-a)log(l-g) - / dxdy x[x,y] P[x,y] 
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+ a jDzDy log fdx e -*^ +x[Ay+Bz]+ « x[x ' y] (4.16) 

Note that ( pJ|) is to be minimised, both with respect to q (which originated 
as an n(n — l)-fold entry in a matrix, leading to curvature sign change for 
n < 1) and with respect to x[ x i If] (obtained from the n-fold occurrence of the 
function P[x, y], multiplied by i, which also leads to a curvature sign change). 

The remaining saddle point equations correspond to the (functional) vari- 
ation with respect to x- 

e -\y 2 , [ e -2offci)+^+ fe ]+ix[^] 



e 59 r 

for all x, y : P[x, y] = — J Dz 



2- ■' I fdx' e~^^ +x ' [Ay+Bz]+ « x[x '' y] 



(4.17) 

and q (using equation (|4.17|) wherever possible): 

J dxdy P[x, y] x 2 — 2R J dxdy P[x, y] xy — qQ(a~ 1 — 1) + R 2 oT x 



2JqQ-R* + 



Q(l-g) 

VqQ-R 2 



r a 

I DZDy dz 



I fdx X e -^T) +x[Ay+Bz]+ « x[x > v] 

(4.18) 



4.3 Explicit Expression for the Green's Function 



In order to work out the Green's function ( |4.9| ) we need C[u, v; u', v'\ as defined 
in ( |4.5|) which, in turn, is given in terms of the integrals ( |4.6j , |4.7| ). First we 
calculate in RS ansatz the n — * limit of D[u,v; q, {x}] ( |4.6|) , using (|4.14[) , 
and simplify the result with the saddle-point equation ( |4.17| ): 



hm V[u,v;q,{x}]= DzDy e 



-ivy 



fdx e 2Qir=i5 



+x [Ay+Bz] + — x [x,y]—iux 



fdx e sou-,) 



+x[Ay+Bz] + ±- x [x,y] 



dxdy P[x,y] e- lvy - tux (4.19) 
Next we work out F^[u,v] ( |4.7| ) in RS ansatz, using ( |4.14| ) , with A G {1,2}: 



lim J-\\u, v] = i lim 

n— >0 n^O 



DyDz e~ 



ivy 



dx e 



r 2 

2 1-<j 



I^.+[ zv / d-72/]x0+ix[V^a: 1 3 ! 2/] 



-tuxi 



VQ 



d\x[yQ%a,y] 



Replica permutation symmetries allow us to simplify this expression: 
\im^[u,v] = 5 al F x [u,v) + (l-5 al )F 2 [u,v] 

n— »0 



(4.20) 
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with 
and 



Fl[u,v] =i dxdy P[x,y] e^"* d xX [x,y] 



(4.21) 



DyDz e 



-ivy 



jdx e -2S^ +x[Ay+Bz]+ « x[x > y] V 

(4.22) 

We can now proceed with the calculation of (|4.5| ), whose building blocks are 

0£~^J~^[u, 1)\ + u8 al V[u, v) = 6 a iGx[u,v] + (1 — S al )G 1:2 [u,v] 

oT 1 ^^, v] +v5 al V[u, v] = 6 a iG 2 [u,v] + (l—5 al )G 2 [u,v] 

with 

Gi[u,v] = a~ J-l 2 [u, v] + uV[u, v] Gi[u,v] = a~ J-^u, v] 

G2[u,v] = oi~^J~2 [u, v\ +vV[u,v] G 2 [u,v) = a^jF 2 [u, t>] 

and their Fourier transforms: 



&i[u,v] 



G 2 [u,v] 



du dv 
du dv 



iuu+ivv 



Gi[u,v\ 



■ iu<i+M G 2 [u,v] 



Gx[u,v] 



G 2 [u, v 



du dv 
du dv 



iuu+ivv 



Gi[u,v] 



iuu+ivv 



G 2 [u,v] 



(2tt) 2 zl ' 1 ZL ' 1 J (2tt) 2 

With these short-hands we obtain a compact expression for (|4.5| ), and can 
subsequently write our expression ( [4.9| ) for the Green's function A[x, y; x', y'} 



as 



A[x,y;x',y'] = -Q(l-q) \G x [x, y]G 1 [x', y'] - G x [x, y]Gi[x', y'] 
-Qq [dfay] - G 1 [x,y}] [G x [x' ,y'] - G 1 [x',y'} 

-R [Gfay] - G a [x,y]] [G 2 [x',y'] - G 2 [x,y'\ 

-R [d 1 [x',j/] - G 1 [x , ,y'}] [G 2 [x,y] - G 2 [x,y] 

- G 2 [x,y) - G 2 [x,y] G 2 [x',y') - G 2 [x',y' 



(4.23) 



Finally, working out the four relevant Fourier transforms, using ( [4.19| , fl2~T1 , f4.22| ), 
gives: 

"1 „ r , d r , d 



Gx[x,y] = i 
G 2 [x,y) = i 



-P[*,y] ^[x,y]-^P[x,y] 
^P[x,y] ^ x [x,y]--^P[x,y} 
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Gi[x,y] 



i e 2 



V 



a V27T 



Jdx' e 2 Q(^) 



Dz 



-x'[Ay+Bz] + j- x [x',y] 



dixW,y] 



e ao(i-a) 



G 2 [x,?/] 



z e 2 



V 



a V27T 



J dx > e -2^+AAy+B Z ] + ^ X W,y) Q 2X ^^ 



- +x [Ay+Bz] + ±xix,y] 



Jdx' e 



S_ s - +x '[Ay+Bz]+±x[x',y] 



1 2 



4.4 Summary 

At this stage it is advantageous to summarize the theory and choose the most 
transparent representation of our equations. We first replace the function 
by the effective measure M[x,y]: 

M[x } y] = e ~^=^ +Axy+ « x[x '' y] 
We introduce a compact notation for the various averages we encounter: 

(4.24) 

with the short-hand ,__ 

B _ VqQ- W 

Q(i-g) 

From equation ([4.17]) we deduce that the function P[x,y] always obeys 

P[x, y] = P[x\y] P[y] P[y] = {l^e'^ (4.25) 

This enables us to write our results in terms of rather than P[x,y}. 

We also introduce the transformed Green's function A via A[x, y; x', y'\ = 
P[x, y] A[x, y; x', y'} P[x', y'}. In combination these simplifications allow us to 
summarize our theory as follows. The macroscopic dynamic equations are: 

dp d f 

—Q = 2r]J dxDy P[x\y] x Q[x,y]+L —R = n J dxDy P[x\y] y Q[x,y] 

(4.26) 

d d f rl r //ll 

—P[x\y]=-rj—lP[x\y] -S[x,y] + J dx Dy P[x \y ]Q[x , y ]A[x, y;x ,y] > 

+ \ L ^ p \Ay] (4-27) 
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with 



batch : L = on-line : L = rj 2 J dxDy P[x\y}g 2 [x, y] (4.28) 

The solution of ( )4.2tj| , |4.27| ) subsequently generates the training- and general- 



ization errors (|2.5|J2.6| ) at any time: 

E t = j dxDy P[x\y] 9[-xy] E g = - axccos[R/^Q] (4.29) 



In order to determine the Green's function A[x, y; x', y'} in equation ([4.27| ) 
one first has to calculate the auxiliary order parameters q and {M[x, y}} by 
solving the following two saddle point equations: 



for all X, y : P[X\y] = J Dz (5[X-x\ 



(4.30) 



JdxDy P[x\y] (x 2 -2Rxy) - qQ + f- = [2y/qQ-R 2 + %] jDyDz z(x)+ 

(4-31) 

The Green's function in equation ( 4.27Q is then given by 



A[x,y;x',y'] = Q(l-q) [Ji[x,y]Ji[x',y']-Ji[x,y]J 1 [x',y f \ 
+Qq Ji[x,y]-Ji[x,y] Ji[x',y']- Ji[x',y'} + J 2 [x, y] J 2 [x', y'\ 



+ R 



Ji[x,y}-Ji[x,y] J 2 [x',y'] + R J^x' ,y']-J 1 [x' ,y'} J 2 [x,y] (4.32) 



x-RY 

Q(i-q), 



(4.33) 

(5[X-x])+ (4.34) 



with the functions 

T\XY\- 9 loz M[X > Y] I X ~ RY 
Jl lx ' Y] ~ dx log T[X\Y] + Q^Tq) 

Jt [X, Y] = PIXIY}- 1 Jdz(^- log M[x, Y] + 

/ urn - 9 \oc M[x ' Y] RX I Y 

- P\X\Y\-' jDzUL log Mfx, Y] - QQZjj) W-*))* (4-35) 

We finally work out a simple inequality to determine for which regime of q- 
values one should inspect the saddle-point equations. From ( |4.12|) it follows 
that q = (±Ei(vi) 2 )5, giving 



N 



q ~ R 2 /Q 
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5 Applications and Tests of the Theory 



5.1 The Limit a 



oo 



The very least we should require of our theory is that it reduces to the simple 
(Q, R) formalism of infinite training sets in the limit a — > oo. This indeed 
happens. For a — > oo our macroscopic equations ( [4.26| , |4.27|) reduce to 

d f d r 

—Q = 2 V J dxDy P[x\y] x Q[x\ y} + L —R = V J dxDy P[x\y] y Q[x; y] 

(5.1) 

d d ( f ~ ~\ L d 2 

qI P \. x \v\ = "VfaY^y} J dx'Dy'Plx'ly'} G[x' ,y']A[x,y;x' ,y']>+-—P[x\y] 

. ~ X (5 - 2) 

with the Green's function A[x, y; x', y'] (|4.32j ), and with L as given in (14.281) . 

The saddle-point equations from which to solve {M[sc, y}} and q are: 

P[X\y] = jDz (6[X-x})+ 

qQ + Jdx Dy P[x\y] (x 2 -2Rxy) = 2^qQ-R 2 + B~ l j DyDz z(x)* 

with the convention (|4.24|) , and with the short-hand B = y/qQ — R 2 /Q(l— q). 
We now make the following ansatz: 



P[x\y] = [2n(Q-R 2 )Y 



2 e 2 



(5.3) 



and find that the two saddle-point equations are simultaneously solved by 



M[x,y] = [27rQ(l-g)]-a e~s 



Ux-Ry] 2 /Q(l-q) 



(5.4) 



(we will return to the question of uniqueness later). The values of Q and R at 
any time are fully specified by ( p.l[ ); what remains is to verify the validity of 
( |5.2|) . Given the measure (|5.4|) one can calculate the three functions Ji[x,y], 
J\[x,y] and J 2 [x,y], and thus the Green's function ( f4.32| ), explicitly: 

x-Ry j Qy-Rx 

J n x >y\ = 7i — m J 1 [x,y\ = Mx,y\ 



Q-R 2 
A[x,y;x',y') 



{x-Ry)x' + (Qy-Rx)y' 



Q-R 2 



(5.5) 



Q-R 2 

This, in combination with the equations ( |5.1|) for Q and R, leads to an explicit 
expression for the diffusion equation 



d_ 

dt 



P[x\y] 



d_\Ld_ P[x\y] 
dx\2dx ™ Q-R 2 



Insertion of our ansatz (|5.3|) into both sides of this equation, followed by 
some rearranging of terms, shows that it is indeed satisfied. This confirms 
that from our general theory we indeed recover for a — > oo the standard 



theory for infinite training sets, i.e. the closed set ( |5.1| , |5.3| ), as claimed. 
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5.2 Locally Gaussian Solutions 



As soon as a is finite the diffusion equation (|4.27|) will no longer have Gaussian 
solutions. However, we will show in this section that for a specific simple class 
of learning rules one can find solutions described by a conditional distribution 
P[x||/] which for each y is Gaussian in x, but with moments which are non- 
trivial functions of y (such that the full distribution P[x,y] is not Gaussian): 



G[x,y] = Go[y] +xg l [y\ : 



P[x\y] 



e 2 



I[x-x(y)] 2 /A 2 (y) 



A(2/)V2tt 



(5.6) 



We choose the distribution P[x|y] in ( |5.6| ) as our ansatz. Insertion into the 
saddle-point equation ([4. 30|) and integration over z shows that ( [4.30|) is solved 
by the following measure (the issue of uniqueness will be discussed later): 



M[x, y) 



a(y)y/2ir 

with the usual short-hand B = \ / qQ — R 2 /Q(l—q), and with 

1 



(5.7) 



2B 2 



U/l + 45W(y)-l] 



(5i 



The simple form of (|5.7| ) allows us to proceed analytically, using identities 
such as 



(a:)* = x(y) + Bza 2 (y) 



{[x-x{y)]X = cj\y)+B^cj\y) 



The remaining saddle point equation ( [4.3 1| ) for q can be simplified to 



Dy A 2 (y) + x 2 (y)-2Ryx(y)-qQ( 1 + — 



R 2 



a 



2(qQ-R 2 



Dya 2 (y) 



Q(i-g) 

(5.9) 

Next we have to show that our ansatz ( |5.6| ), in combination with ( |5.8j , p.9| ), 
solves the dynamic equations (|4.26| , |4.27|) . In order to do so we first calcu- 
late the building blocks of the Green's function A[x,y;x',y'], giving (after a 
modest amount of bookkeeping): 

Jt[x,y] = xV 1 (y)+V 2 (y) J x [x,y\ = xV 1 (y)+V 2 (y) J 2 [x,y] = xW 1 {y)+W 2 (y) 
with the six functions 

a 2 (y)A 2 (y)+Q(l-q) [a 2 (y) - A 2 (y)} 



Vi{y) 



V 2 {y) 



Q(l-q) o*(y) A 2 (y) 

x{y) Q(l-q) [A 2 (y)-a 2 (y)]-Ry a 2 {y) A 2 (y) 
Q(l-q) a\y) A\y) 
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Vi(y) 



[o\y)-A\y)\ [Q(l- q )- a \y)] 



V 2 {y) 



Q(l-q) o*{y) A*(y) 
x{y)[A\y)-a\y)} [Q(l- q )-q\y)] + (x(y)-Ry) A\y) a\y) 
Q{l-q) a\y) A 2 (y) 

l[!J) Q(l-q) o*{y) A*{y) 

ya\y) A\y) Q(l- q ) + Rx{y) a\y)] 
Q(l-q) a\y) A*(y) 



W 2 (y) 



Insertion of these expressions into the Green's function ( 4.32 ) subsequently 
gives the simple expression 

A[x, y; x, y'\ = xx'U^y, y') + xll 2 (y, y ) + x'U 2 (y', y) + U 3 (y, y') (5.10) 

with the three kernels 

U 1 (y,y') = W 1 (y)W 1 (y') 

+Q(l-q) [V 1 {y)V 1 M-V 1 {y)V 1 (y')]+Qq [^{y) [v^y') -V^y') 

+ R[V 1 {y)-V 1 {y)} W 1 (y') + RW 1 (y) [V^-V^y')] (5.11) 
U 2 (y,y') = W 1 (y)W 2 (y') 
+Q(l-q) [V 1 {y)V 2 (y , )-V 1 (y)V 2 {y')] +Qq [Vi(j/)-Vi(y)] [V 2 (y') -V 2 (y') 

+ R[V 1 {y)-V 1 {y)} W 2 (y') + R W 1 (y) [V 2 {y')-V 2 {y')] (5.12) 
U 3 (y,y') = W 2 (y)W 2 (y') 
+Q(l-q) [v 2 (y)V 2 (y')-V 2 (y)V 2 (y')]+Qq [v 2 (y) -V 2 (y)] [v 2 (y') -V 2 (y') 

+ R[V 2 (y)-V 2 (y)} W 2 (y') + R W 2 (y) [V 2 (y')-V 2 (y')} (5.13) 

Insertion of the above expression ( |5.10| ) for the Green's function into the 
right-hand side of our diffusion equation fl4.27|) gives, in turn: 



A 2 (y) RHS _ L 

—7 — ; — 7 rtrio — 

P[x\y] 2 



_ [x-x(y)} 
A\y) 



+ r)[x-x{y)} \ -G[x,y] 



1 



a 



+ [x-x(y)} J dx'Dy' P[x'\y'} G[x',y'} [x'U 1 (y,y') + U 2 (y,y')} 
+x(y) Jdx'Dy' P[x'\y'\ G[x',y'] [x'U 1 (y,y , )+U 2 (y,t/)] 
+ Jdx'Dy' P[x'\y'} G[x\y'] [x'U 2 (y' ,y) + U 3 (y,y')} 

- V A 2 (y) iLJLgfay] + Jdx'Dy' P[x'\y'] g[x',y'] [x'U x {y,y')+U 2 {y,y')\ 
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For the left-hand side of the diffusion equation for P[x|y] we get 



A(y) 



A(y) 



A(y) 



Since LHS is a second-order polynomial in x, we have to restrict ourselves to 
functions Q[x, y] which are first-order polynomials in x, hence the definition 
( |5.6|) . We equate the three monomials of x — x(y) and find for self-consistent 
solutions of the locally Gaussian type ( |5.6|) the following conditions: 



[x - x{y)} 2 



+ ! dx'Dy' P{x'\y'} g[x',y'] [xV^y, y') + U 2 (y, y')] 



IX — X 



dt 



x(y) = n \ a 1 [g Q (y) + x(y) Gi(y)] 



\x — x 



+ x(y) J dx'Dy' P[x'\y'} G[x',y'\ [x'Ui(y,y') + U 2 (y,y')\ 
+ f dx'Dy' P[x'\y') g[x',y'} [x'U 2 (y' ,y) + U 3 (y,y')} 



+ r ] A(y)\a- 1 g 1 (y) 



d . L 
Jt {V) = 2A{y) 



+ dx'Dy' P[x'\y'} Q\x\y'\ \x'U x {y,y') + U 2 (y,y')} 



This is an important self-consistency test, since we have three equations for 
only two functions (x(y) and A(y)). Two of the three equations are found to 
be identical. In the remaining two equations we insert the form of the learning 
rules ( |5.6|) and perform the x-integrations, giving 



d . _ L 



r)A(y) 



a 



+ Dy' g (y f ) [x(y') U^y^ + U^y^y')] 



(5.14) 



+ jDy' g^y') [[x 2 (y') + A 2 (y')} U x (y, y') + x(y f ) U 2 {y,y') 
j t x(y) = V {± [g (y)+x(y) g 1 (y)]+T{y) J Dy' g (y') [x^U^y^ + U^y^y')] 
+x(y)jDy' g^y') [[x 2 (y') + A 2 (y')] U^y') +x(y') U 2 (y,y')~ 
+ jDy' g (y') W) U 2 (y , ,y)+U 3 {y,y')} 

+ jDy' g 1 (y') [[x 2 (y') + A 2 (y')} U 2 (y' , y) + x(y') U 3 (y,y')}} (5.15) 

Our result is a solution in the form of coupled equations (|4.26| , |5.6| , |5~9l , |5.14j , 
|5.15|) , without a functional saddle-point problem and without a diffusion equa- 
tion. 
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5.3 Explicit Example: Hebbian Learning 



The simplest non-trivial member of the family (|5.6| ) of learning rules with 
locally Gaussian field distributions is the Hebb rule: Q[x, y] = sgn[y] (i.e. 
Qo(y) = sgn(y) and Gi(y) = 0). This example we will work out in full, as 
an explicit illustration and in order to have a precise test (since only for this 
rule one can calculate the macroscopic observables exactly and directly from 
the microscopic laws). The remaining integrals in our equations (|4.26| . |5.6| . 
|5~9] . |5.14] , |5.15| ) can be carried out explicitly. We find (for initial states J(0) 
which are not correlated with the questions in the training set): 



R = R + 7]t\l- 



Q = Qo + nt 



[2 




\2 11 


K7] + 2R d — 


+ v 2 t 2 


- + - 




.7T a. 



(5.16) 



(with k = 1 for on-line learning and k = for batch learning), and 

x(y) = Ry + a~ r)t sgn(y) (5-17) 

To find the width A of (which is found to remain independent of y, if 

so at t — 0) , we have to solve the coupled equations (|Q| , |5.14j ), which for the 
Hebb rule reduce to 



1 d . 9 
-—A 

n at 



Vt Q(l-q) 
a qQ — R 2 



2+2 



(i^Q)(--lW+^ 

a J a z 



Q0-q) 



\ 



1 + 



Q 2 (l-q) 2 
Q(l-q) 



2(qQ~R 2 ) 



A{qQ-R?)& 

\ Q 2 (i-q) 2 

(5.18) 



The solution of these equations is given by 

A 2 = Q-R 2 q = Q~ l [R 2 + a" 1 !? 2 * 2 ] 

This solution is unique (for a proof see Coolen and Saad, 1998). Thus: 

p -hx-Ry-a- l nt sgn(y)] 2 /(Q-R 2 ) 

P[x\y] = 



/2tv(Q - R 2 ) 

We can now also calculate both errors as a functions of time: 



(5.19) 



E„ = — arccos 

S 7T 



R 



E^\-\\Dyeri 



\y\R+nt/a 
2(Q-R 2 ) 



(5.20) 



Asymptotically one finds lim^oo q = 1, and 



lim Eg. — — arccos 



2a 



V2a + 7r 



lim E t — / Dy erf 

t-oo 2 2' 



fa 1 
V 7r J2a 
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Figure 4: Fig. 4: Asymptotic values of the two errors, i.e. Hindoo E g and 
Hindoo E t , for Hebbian learning (both on-line and batch), as functions of the 
relative size a = p/N of the training set. 



Both asymptotic errors are independent of k, i.e. on whether batch or on-line 
learning is used. These results are depicted in Fig. 4. 

A final object to be calculated is the student field distribution P(x) itself, 
via P(x) = jdy P[x, y] = JDy P[x\y]. This gives 



P(x) 



-\[x+ V t/a] 2 /Q 



1 — erf 



/ R[x+r]t/a] 



+ 



2 V / 2^FQ 



erf 



R[x— rjt/a] 
/2Q(Q-R 2 ) 



(5.21) 



5.4 Comparison with Exact Results and Simulations 



Only for the (simple) Hebbian rule can our dynamic order parameters in fact 
be calculated directly from the microscopic learning rules, even for finite a, 
which provides an excellent benchmark for candidate general theories. Exact 
evaluation for on-line learning is found to give (Rae et al, 1998): 



R = Ro + r]t\l- 



Q = Qo + rjt 







ri 2i 




+ v 2 t 2 


-+- 




.a 7T_ 



2 X 



{Q-R 2 ]+ix{x-yR} + ±{e- ir >z s g n («)-l] 



(5.22) 



(5.23) 
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0.5 




Figure 5: Fig. 5: Comparison between simulation results for on-line Hebbian 
learning (with system size N = 10, 000) and dynamical replica theory, for 
a G {0.25,0.5,1.0,2.0,4.0}. Upper five curves: generalization errors as func- 
tions of time. Lower five curves: training errors as functions of time. Circles: 
simulation results for generalization errors; diamonds: simulation results for 
training errors. Solid lines: corresponding theoretical predictions. 



Comparison with ( |5.16| , [5TTP| ) shows that our theory gives the exact expressions 
for Q and R (and thus for E g ), but an approximation for P[x|y] (and thus 
for E t ) as soon as both a and t are finite. The most transparent comparison 
is made in terms of the Fourier transform = Jdx e~ lkx P[x\y\. 

P[%]cxact = e-^ Q ~ R2 ^ kR y + ^ ,vk smiy) -^ (5.24) 

P[k\y) DKr = e-l^iQ-R^-tkRy+^-ivk sgnfe)] ( 525 ) 

One obtains P[fc|y]DRT by retaining only the first two orders in the expansion 
of the term e~ %r]k sgn ( y ) in the exponent of P[k\y] e ^ c t. The difference between 
the exponents of Q5.24 ) and ( |5.25 ) is in 0(t) rather than 0(t 2 ) terms. Since the 



latter control the asymptotics, exactness of our theory is (for any a) restored 
for t —>■ oo; thus the asymptotic errors shown in Fig. 4 are also exact. 

We conclude: either replica symmetry must be broken (RSB), or our set 
of order parameters {Q, R, P} does not yet obey closed deterministic and 
self- averaging dynamical equations (otherwise our theory would have been 
exact, by construction). The natural ways to try improve our theory are thus 
either to construct RSB solutions of the saddle-point equations, or to add 
observables to the order parameter set, such as the Green's function 
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Figure 6: Fig. 6: Comparison between simulation results for on-line Hebbian 
learning (system size N = 10, 000) and dynamical replica theory, for a G 
{0.5, 1.0, 2.0}. Dots: local fields (x, y) = ( J-£, (calculated for questions in 
the training set), at time t = 50. Dashed lines: conditional average of student 
field x as a function of y, as predicted by the theory, x(y) = Ry+(nt/a) sgn(y). 
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Figure 7: Fig. 7: Simulations of Hebbian on-line learning with N = 10, 000. 
Histograms: student field distributions measured at t — 10 and t = 20. Lines: 
theoretical predictions for student field distributions, a = 4 (upper), a = 1 
(middle), a = 0.25 (lower). 
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In Fig. 5 we compare the predictions for the generalization and training 
errors ( |5.2(J| ) of our theory with the results obtained from numerical simula- 
tions for N = 10, 000 (initial state: Q = 1, P = 0; learning rate: rj = 1). The 
curves for the generalization errors show full agreement between theory and 
experiment, as guaranteed by (|5.22| ). However, the transients of E t show devi- 
ations, which become more pronounced with decreasing a, but which vanish 
with increasing time. This trend immediately follows from ( p. 24 , 5. 25 ). 



We also compare the theoretical predictions made for the distribution 
P[x|?/] with the results of numerical simulations. This is done in Fig. 6, where 
we show the fields as observed at time t = 50 in simulations (N = 10, 000, 
rj — 1, Rq — 0, Qo = 1) of on-line Hebbian learning, for three different values 
of a. In the same figure we draw (as dashed lines) the theoretical prediction 
for the ^/-dependent average of the conditional ^-distribution P[x|y]: 

x(y) = Ry+ sgn(y)rjt/a 

We observe that our expression ( |5.19| ) for P[a;|?/] (a Gaussian distribution in 
x, with y-dependent average given by x(y) as given above) indeed captures 
the qualitative features of the (x, y) statistics. Clearly, P[x, y] is itself not a 
joint Gaussian distribution. 

Finally we compare the student field distribution P(x), as observed in 
simulations of on-line Hebbian learning (N = 10, 000, 77 = 1, Rq — 0, Qo — 1) 
with our prediction (|5.21|) . The result is shown in Fig. 7, for a G {4, 1, 0.25}. 



The agreement is again quite satisfactory, as could have been expected. 



6 Discussion 

In this paper we have shown how the formalism of dynamical replica theory 
(e.g. Coolen et al, 1996) can be used successfully to build a general theory 
with which to predict the evolution of the relevant macroscopic performance 
measures for supervised (on-line and batch) learning in layered neural net- 
works with randomly composed but restricted training sets (i.e. for finite 
a = p/N), where the student fields are no longer described by Gaussian dis- 
tributions, and where the more traditional and familiar statistical mechani- 
cal formalism consequently breaks down. For simplicity and transparency we 
have restricted ourselves to single-layer systems and realizable tasks. In our 
approach the joint field distribution P[x,y] for student and teacher fields is 
itself taken to be a dynamical order parameter, in addition to the more con- 
ventional observables Q and P; from this order parameter set {Q,R, P}, in 
turn, immediately follow the generalization error E g and the training error 
Et. This then results, following the prescriptions of dynamical replica theory^, 

2 The reason why replicas are inevitable (unless we are willing to pay the price of having 
observables with two time arguments, and turn to path integrals) is the necessity for finite 
a to average the macroscopic equations over all possible realizations of the training set. 
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in a diffusion equation for P[x,y], which we have evaluated by making the 
replica-symmetric ansatz in the saddle-point equations. This diffusion equa- 
tion is found to have Gaussian solutions only for a — > oo; in the latter case 
we indeed recover correctly from our theory the more familiar formalism of 
infinite training sets, with (in the N — > oo limit) closed equations for Q and 
R only. For finite a our theory is by construction exact if for iV — ► oo the 
dynamical order parameters {Q, R, P} obey closed, deterministic equations, 
which are self-averaging (i.e. independent of the microscopic realization of the 
training set). If this is not the case, our theory is an approximation. 

We have worked out our general equations explicitly for the special case 
of Hebbian learning, where the existence of exact results, derived directly 
from the microscopic equations (even for finite a), allows us to perform a 
critical test of our theory 0. Here we find that our theory does produce correct 
predictions for the observables Q, R and E g , but an approximation for P[x, y] 
and E t if a is finite (although the stationary state predicted is again correct). 

The present study clearly represents only a first step, and many extensions, 
applications and generalizations can and should be made. To name but a few: 

(i) Application to Different Learning Rules 

So far our theory has only been applied to Hebbian learning, in view of its 
special status as a rigorous benchmark (Rae et al, 1998). Further application 
to non-Hebbian learning rules, subsequently to be tested via simulations, is 
clearly called for. For rules of the type ( |5.6|) the saddle-point equations are 
still simple; in general one will have to solve functional saddle-point equations 
at each instance of time, which is a non-trivial numerical exercise. 

(ii) Application to Multi-Layer Networks 

Our theory generalizes in a natural way to multi-layer networks, provided the 
number of hidden neurons remains finite. However, as in the infinite training 
set formalism, the number of observables (and thus the number of saddle- 
point equations) will increase significantly. 

(Hi) Further analysis of saddle-point equations 

We still have to determine the uniqueness or otherwise of the solution of our 
functional saddle-point equation. More ambitious, but not ruled out, are our 
current attempts to solve the functional saddle-point equation explicitly. If 
this is impossible, one further alternative to numerical solution would be a 
variational approach (where upon choosing a restricted parametrized family 
of functions, functional extremization is replaced by ordinary extremization). 

(iv) Replica Symmetry Breaking 

The observed deviations between the present theory and the exact calcula- 

3 Note that such exact results can only be obtained for the relatively simple Hebbian 
rules, where the dependence of the weight updates AJ(t) on the current weights J(t) 
is trivial or even absent (a decay term at most), whereas our present theory generates 
macroscopic equations for arbitrary learning rules. 
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tion for the Hebb rule could indicate RSB (although this is not likely, in view 
of the asymptotic exactness of our RS equations). In the usual manner one 
can calculate an equation for the AT instability, which would in our problem 
define a surface in order parameter space, and determine the onset of RSB. 

(v) Systematic Improvement via Higher Order Observables 

Dynamical replica theory allows for systematic improvement. By adding new 
observables to the order parameter set (which cannot be expressed in terms 
of those already present) the theory will, by construction, become more accu- 
rate. A natural candidate for being added to the set {Q, R, P} is the Green's 
function A[x,y;x',y'] (|3.6|). This would change our problem to closure of a 
dynamic equation for A, which would involve a higher order Green's function. 

(vi) Generalization to Noisy Teachers 

Last but not least, one can generalize our theory to the case of noisy teachers. 
This is a straightforward although tedious exercise, involving a field distri- 
bution of the form P[x,y,z\ (describing student fields, fields of the 'perfect' 
teacher, and fields of the 'noisy' teacher). It will, however, allow us to describe 
over-fitting phenomena in terms of macroscopic dynamic equations. 
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